{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Julia for Data Science - Algorithms\n",
    "\n",
    "### Data processing\n",
    "In what's next, we will see how to use some of the standard algorithms for data analysis implemented in Julia. In particular, we'll look at\n",
    "\n",
    "1. Kmeans clustering\n",
    "1. Nearest neighbors with a KDTree\n",
    "1. PCA for dimensionality reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DataFrames, Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll be using the same data for all three of these examples -- the Sacramento real estate transactions file that we download next. This is a list of 985 real estate transactions in the Sacramento area reported over a five-day period."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "download(\"http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv\",\"houses.csv\")\n",
    "houses = readtable(\"houses.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `Plots` to plot with the `pyplot` backend and start familiarizing ourselves with this data set!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Plots\n",
    "pyplot()\n",
    "plot(size=(500,500),leg=false)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a scatter plot to show the price of a house vs. its square footage,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = houses[!, :sq__ft]\n",
    "# x = houses[7] # equivalent, useful if file has no header\n",
    "y = houses[!, :price]\n",
    "# y = houses[10] # equivalent\n",
    "scatter(x,y,markersize=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Houses with 0 square feet that cost money?*\n",
    "\n",
    "The square footage seems to not have been recorded in these cases. \n",
    "\n",
    "Filtering these houses out is easy to do!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filter_houses = houses[houses[!, :sq__ft] .> 0, :]  # dot broadcasting\n",
    "x = filter_houses[!, :sq__ft]\n",
    "y = filter_houses[!, :price]\n",
    "scatter(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This makes sense! The higher the square footage, the higher the price."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can filter a `DataFrame` by feature value too, using the `by` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "by(filter_houses,:type,size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "by(filter_houses,:type,filter_houses->mean(filter_houses[!, :price]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 1: Kmeans Clustering\n",
    "\n",
    "Now let's do some kmeans clustering on this data.\n",
    "\n",
    "First, we can load the `Clustering` package to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Pkg.add(\"Clustering\")\n",
    "using Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's store the features `:latitude` and `:longitude` in an array `X` that we will pass to `kmeans`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = filter_houses[!, [:latitude,:longitude]]\n",
    "X = Array{Float64}(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each feature is stored as a row of `X`, but we can transpose to make these features columns of `X`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = transpose(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's plot longitudes vs. latitudes for this data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "location_figure = scatter(X[1, :], X[2, :])\n",
    "# Alternatively....\n",
    "# location_figure = scatter(filter_houses[!, :latitude], filter_houses[!, :longitude])\n",
    "xlabel!(\"Latitude\")\n",
    "ylabel!(\"Longitude\")\n",
    "title!(\"Houses plotted by location\")\n",
    "display(location_figure)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to identify clusters in the data above. As a first pass at guessing how many clusters we might need, let's use the number of zip codes in our data.\n",
    "\n",
    "(Try changing this to see how it impacts results!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = length(unique(filter_houses[!, :zip])) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `kmeans` function to do kmeans clustering!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "C = kmeans(X,k) # try changing k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a new data frame, `df`, with all the same data as `filter_houses` that also includes a column for the cluster to which each house has been assigned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = DataFrame(cluster = C.assignments,city = filter_houses[!,:city],\n",
    "    latitude = filter_houses[!,:latitude],longitude = filter_houses[!,:longitude],zip = filter_houses[!,:zip])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot each cluster as a different color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_figure = plot()\n",
    "for i = 1:k\n",
    "    # filter df to grab all houses in the ith cluster\n",
    "    clustered_houses = df[df[!,:cluster].== i,:]\n",
    "    # grab latitudes and longitudes of all houses in the ith cluster\n",
    "    xvals = clustered_houses[!,:latitude]\n",
    "    yvals = clustered_houses[!,:longitude]\n",
    "    # plot latitudes and longitudes of all houses in the ith cluster\n",
    "    scatter!(clusters_figure,xvals,yvals,markersize=4)\n",
    "end\n",
    "xlabel!(\"Latitude\")\n",
    "ylabel!(\"Longitude\")\n",
    "title!(\"Houses color-coded by cluster\")\n",
    "display(clusters_figure)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now let's try coloring them by zip code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_zips = unique(filter_houses[!,:zip])\n",
    "zips_figure = plot()\n",
    "for uzip in unique_zips\n",
    "    # filter houses by zipcode\n",
    "    subs = filter_houses[filter_houses[!,:zip].==uzip,:]\n",
    "    # grab the latitudes and longitudes of all houses in a given zipcode/subdivision\n",
    "    x = subs[!, :latitude]\n",
    "    y = subs[!, :longitude]\n",
    "    # plot the houses in this zipcode by latitude and longitude!\n",
    "    scatter!(zips_figure,x,y)\n",
    "end\n",
    "xlabel!(\"Latitude\")\n",
    "ylabel!(\"Longitude\")\n",
    "title!(\"Houses color-coded by zip code\")\n",
    "display(zips_figure)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the two plots side by side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(clusters_figure,zips_figure,layout=(2, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not exactly! but almost... Now we know that ZIP codes are not randomly assigned!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 2: Nearest Neighbor with a KDTree\n",
    "\n",
    "For this example, let's start by loading the `NearestNeighbors` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using NearestNeighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this package, we'll look for the `knearest` neighbors of one of the houses, `point`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "knearest = 10\n",
    "id = 70 # try changing this\n",
    "point = X[:,id]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can build a `KDTree` and use `knn` to look for `point`'s nearest neighbors!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kdtree = KDTree(X)\n",
    "idxs, dists = knn(kdtree, point, knearest, true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll first generate a plot with all of the houses in the same color,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = filter_houses[!,:latitude];\n",
    "y = filter_houses[!,:longitude];\n",
    "scatter(x,y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and then overlay the data corresponding to the nearest neighbors of `point` in a different color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = filter_houses[idxs,:latitude];\n",
    "y = filter_houses[idxs,:longitude];\n",
    "scatter!(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are those nearest neighbors in red!\n",
    "\n",
    "We can see the cities of the neighboring houses by using the indices, `idxs`, and the feature, `:city`, to index into the `DataFrame` `filter_houses`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cities = filter_houses[idxs,:city]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 3: PCA for dimensionality reduction\n",
    "\n",
    "Let us try to reduce the dimensions of the price/area data from the houses dataset.\n",
    "\n",
    "We can start by grabbing the square footage and prices of the houses and storing them in an `Array`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "F = filter_houses[[:sq__ft,:price]]\n",
    "F = convert(Array{Float64,2},F)'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall how the data looks when we plot housing prices against square footage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scatter(F[1,:],F[2,:])\n",
    "xlabel!(\"Square footage\")\n",
    "ylabel!(\"Housing prices\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `MultivariateStats` package to run PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pkg.add(\"MultivariateStats\")\n",
    "using MultivariateStats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `fit` to fit the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "M = fit(PCA, F)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that you can choose the maximum dimension of the new space by setting `maxoutdim`, and you can change the method to, for example, `:svd` with the following syntax.\n",
    "\n",
    "```julia\n",
    "fit(PCA, F; maxoutdim = 1,method=:svd)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems like we only get one dimension with PCA! Let's use `transform` to map all of our 2D data in `F` to `1D` data with our model, `M`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = transform(M, F)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `reconstruct` to put our now 1D data, `y`, in a form that we can easily overlay (`Xr`) with our 2D data in `F` along the principle direction/component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xr = reconstruct(M, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we create that overlay, where we can see points along the principle component in red. \n",
    "\n",
    "(Each blue point maps uniquely to some red point!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scatter(F[1,:],F[2,:])\n",
    "scatter!(Xr[1,:],Xr[2,:])\n",
    "xlabel!(\"Square footage\")\n",
    "ylabel!(\"Housing prices\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.5",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
